International Journal of Electrical and Computer Engineering (IJECE) 
Vol.8, No.6, December 2018, pp. 4352~4355 
ISSN: 2088-8708, DOI: 10.1159 1/ijece.v8i6.pp4352-4355 O 4352 


A Survey of Arabic Text Classification Models 


Ahed M. F. Al Sbou 


Department of Computer Science, Faculty of Information Technology, Al_Hussein Bin Talal Unversity, Jordan 


Article Info ABSTRACT 

Article history: There is a huge content of Arabic text available over online that requires an 
f organization of these texts. As result, here are many applications of natural 

Received Nov 25, 2017 languages processing (NLP) that concerns with text organization. One of the 

Revised Feb 17, 2018 is text classification (TC). TC helps to make dealing with unorganized text. 

Accepted Mar 2, 2018 However, it is easier to classify them into suitable class or labels. This paper 


is a survey of Arabic text classification. Also, it presents comparison among 


different methods in the classification of Arabic texts, where Arabic text is 
Keyword: represented a complex text due to its vocabularies. Arabic language is one of 
the richest languages in the world, where it has many linguistic bases. The 
researche in Arabic language processing is very few compared to English. As 
a result, these problems represent challenges in the classification, and 


Arabic language processing 
Arabic text categorization 


Arabic text mining organization of specific Arabic text. Text classification (TC) helps to access 
Classification algorithms the most documents, or information that has already classified into specific 
Clustering algorithms classes, or categories to one or more classes or categories. In addition, 
Natural languages processing classification of documents facilitate search engine to decrease the amount of 
Text classification document to, and then to become easier to search and matching with queries. 


Copyright © 2018 Institute of Advanced Engineering and Science. 
All rights reserved. 


Corresponding Author: 


Ahed M. F. Al Sbou, 

Department of Computer Science, Faculty of Information Technology, 
Al Hussein Bin Talal Unversity, 

Rawdat Al-Amir Rashid, Ma'an, Jordan. 

Email: ahed_alsbou @ahu.edu.jo 


1. INTRODUCTION 

The Arabic language is one of the most common languages with more than 420 million speakers 
over the world. Unlike English, Arabic doesn’t have upper cases. It also differs from other natural languages 
due to the presence of diacritics which represent a small vowel letters such as “fatha, kasra, damma, sukun, 
shadda, and tanween”. The Arabic language's orthographic system is based on diacritics effect, where each 
specific type of diacritics produces different words with different meanings. This language has specific letters 
known as Arabic vowels (waw, yaa, alf) that require a special system of morphology and grammars. What 
also distinguishes Arabic is the huge amount of vocabularies and concepts [1]. 

Although the Arabic texts are viewed as the most difficult ones, there are few studies on the 
processing of Arabic texts for reasons related to the linguistic characteristics of the Arabic language Due to 
the strict linguistic characteristics of Arabic texts and the limitations of studies on processing it [2], this study 
will deal with a variety of NLP applications that have recently emerged to manipulate languages such as 
Arabic, English, and Urdu. One of these applications is text classification (TC), which aims to make a set of 
documents from unstructured documents. This structured set of texts includes a description of the content of 
documents. TC is a process of classifying the textual document into groups based on subject’s similarity or 
other features [3]. 

This paper is organized as follows. The present section offers a brief introduction to the topic and 
the design of the paper. The second section briefly describes the related works in the area of Arabic texts 
classification. The third section displays the Arabic text challenges. The fourth section provides a brief 
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explanation about the common techniques and algorithms used in Arabic texts classification. The last section 
provides a summary of the paper and suggests future work. 


2. RELATED WORK 

There are many classification Algorithms that have been applied to Arabic Texts. In their 
application of Naive Bayes (NB) algorithm to classify 1500 Arabic text documents, El-Kourdi et al find five 
major categories whose results indicated that the accuracy was around 68.78% [4]. Sawaf et al conducted 
another study based on collecting data from the Arabic NEWSWIRE corpus by using statistical methods. The 
results were 62.7% [5]. 

El-Halees et al also classified 300 Arabic documents by applying different algorithms such as vector 
space model (VSM), K-Nearest Neighbor algorithms (KNN), and Naive Bayes (NB). The accuracy of the 
classification was 74.41% [6]. The same accuracy was obtained in Al-Zoghby's study which includes 
CHARM algorithm to classify Arabic text documents from 5524 records [7]. 

Other studies applied by Mesleh to classify Arabic document through using Support Vector 
Machine (SVMs) with Chi Square feature. He conducted an experimental study on 1445 online Arabic 
corpus that involves Al-Nahar, Al-hayat, Al-Jazeera, Al-Ahram, and Al-Dostor to be classified into 9 
categories. The F-measure result was 88.11% [8]. Harrag et al developed Arabic TCs through using Hybrid 
approach with tree algorithm factor to select the features. The data was collected from several Arabian 
scientific encyclopedia in many fields. The accuracy was 91% and 93% for literary and scientific corpus, 
respectively [9]. 


3. ARABIC TEXT CHALLENGES 

Natural languages such as Arabic language have been processed through different methods. This 
language which has several textual features requires a specific categorical environment of its morphology, 
concepts, and ontology. Arabic language is one of the most complex natural languages. It comprises 28 
characters [1]. The characters in this language are written in different forms based on their positions in the 
word. The characters may come in the front, middle, or last part of the word [10]. TC seeks to collect similar 
documents into specific categories that assign the categories of Arabic texts, and manipulate the relative 
categories that have been produced from other text classifications [11]. 

The system of retrieving information from the large amount of Arabic texts accessible on the web is 
very challenging. The retrieval task of query to all relevant documents is very important to the users, too. 
Therefore, the TC to access the different categories makes the processes of query easier and then can attain 
the information needed from them [12]. Further, Arabic texts include some problematic issues due to the 
nature of language. To the best of my knowledge, the studies on Arabic language are very limited, in which 
there is a lack of Arabic corpus, language tools, and comprehensive studies on preprocessing Arabic texts. 
All these problems refer to diverse areas of challenges to categorize the specific Arabic textual data into a 
closed category. 


4. TEXT CLASSIFICATION 

TC includes different phases. The first phase starts from preprocessing the text to remove the 
punctuations, stop words, and normalization. The second and third phases include TC and evaluating the 
classified text [7], [11], [13]. TC is the best mechanism to manage and organize the data. It helps machine to 
access the data categories and text labels using predefined process [14], [15]. This mechanism can be used to 
classify a group of documents into kinds of documents using several features such as contents, authors, or 
publisher [16]. The core goal of TC is to convert unstructured text into organized or structured that can be 
used in different NLP applications such as summarization or retrieval [10], [17]. 

There are two methods utilized in TC: machine learning in which the text can be classified by using 
a set of training documents, and rule-based TC which allows the usage of experts, or engineer's knowledge to 
classify the text [18]. Furthermore, the TC can be used in several applications of computer science such as 
spam or e-mail filtering, or as an accessible tool for interesting information in particular documents [4], [9]. 


4.1. Common Models, and Algorithms of Arabic Text Classification 

Different algorithms are used to classify the Arabic documents. In this section, we will focus on the 
following models: Naive Bayesian algorithm (NB), K-Nearest Neighbor algorithm (KNN), Support Vector 
Model (SVM), Artificial Neural Network (ANN). 
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4.1.1. Naïve Bayesian Algorithm 

NB is a machine learning technique used to classify text into predefined categories based on the 
similar features. NB has been applied to improve the processing and manipulating of texts or information 
from different sources. This algorithm represents a probabilistic method. In other words, NB classifier 
assumes that the absence of class feature is unrelated to the absence of other features. NB is commonly used 
to classify documents due to that is given a good performance in classification, NB computes the probability 
of documents that related to classify them into different classes, and then assigns them to the specific class 
with the highest probability [19]. 

Like many other models, NB has numerous advantages. It is generally considered the most powerful 
model used in this field. NB is understandable and very simple in implementation. As for the disadvantages, 
NB suffers some limitations such as it needs occurrence of class, because depends on probability, whereas 
the probability in usually depends on frequency. 


4.1.2. K-Nearest Neighbor Algorithms 

K-Nearest neighbor (KNN) is another type of machine learning algorithm of TC. It represents a non- 
parametric technique to classify documents or objects depend on closed class or training feature. It includes k 
value that is always a positive value; KNN the object has been classified to close neighbor class. KNN 
attempts to classify the object that is most vote of its neighbor [16]. KNN is possible when training data is 
large, and very large. Yet, there are some disadvantages of KNN including the necessity of defining k- 
parameter value where k represents a nearest neighbor’s number. Also, using this model is so expensive in 
comparison with other algorithms [20]. 


4.1.3. Support Vector Model (SVM) 

VSM is one of the supervised learning models that have been applied for TC. It classifies the 
different objects and documents into a finite dimensional space. VSM is also used to analyze data, texts, and 
documents in order to compute the similarity among them [21]. VSM shows different helpful aspects as an 
important model used in computer science. First, this method defends on a linear algebra, where it doesn't 
contain any complex algebra equation [8]. 

The other advantage is the efficiency of weights ascribed to concepts or terms. This model also 
shows a special sense of ease in comparison with other methods. It makes the machine compute the similarity 
among documents [22]. However, VSM contains some limitations that prevent some researchers to use it. 
The difficulty of using synonyms in Arabic represents a massive challenging area, where Arabic language 
has many synonyms for each word, or concept. Other limitations that it’s assume that the terms are 
statistically independent. While most of Arabic terms have a strong relationship with other terms. 


4.1.4. Artificial Neural Network 

ANN is one of machine learning of information processing likes human brain. It has been applied in 
different computer areas such as classification and pattern recognition. It consists of a set of inputs and 
adaptive weight, non-linear function, and outputs [23]. Different advantages and disadvantages are worth 
mentioning for using artificial neural networks. It represents one of the easy models to use. Also, it is usually 
appropriate for complex problems or large texts. The disadvantages of this model include the less 
recommendation of using with simpler solutions or small texts. This method also needs loading the 
training data. 


5. CONCLUSION AND FUTURE WORK 

This paper is a survey of the importance of TC, as well as the current methods used in NLP field. In 
this research, we have discussed the traditional TC models that are used to classify the Arabic texts, corpus, 
and documents into different categories. The future work needs more efforts to build and develop a new 
standard model of Arabic TC. This model must be more efficient than the current traditional methods. The 
other important task that need to improvement in this model is language dialects. In other words, due to 
different Arabic dialects this model must be compatible with these Arabic language dialects. Also, it can be 
applied in any Arabic texts. 
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